We will use Altair to explore the basic concepts of data visualization. Altair is an extremely powerfull library and it can cover a large range of needs (including fairly advanced interactive functionalities). Nevertheless we are more interested in understading the process behind a "proper" visualization rather than making a fancy visualization for the sake of it. At the end of the Markdown you'll find a selection of links that you can use to learn more about Altrai way beyond what is required in this course.
import altair as alt
from vega_datasets import data
import pandas as pd
Initially we will work with the small iris dataset. You might have seen it before it contais measures for three different varieties of Iris (the flower).
iris = data.iris()
iris
| sepalLength | sepalWidth | petalLength | petalWidth | species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
| ... | ... | ... | ... | ... | ... |
| 145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
| 146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
| 147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
| 148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
| 149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
150 rows × 5 columns
On a general level Altair needs to know:
alt.Chart(iris).mark_point().encode(
x='sepalLength',
y='sepalWidth',
color='species'
)
As you can see the Altair takes care of everything else (colors, legend, axes etc). You can read the code as "I want to visualize iris data with a scatterplot where X represents sepalLength, Y represents sepalWidth and the color of the dot represents the species".
Now this visualization already makes some connections between the type of information and how to visualize it.
Compare it with the one below, What is the difference?
alt.Chart(iris).mark_point().encode(
x='sepalLength',
y='species',
color='sepalLength'
)
So there is clearly a connection between the mark you are selecting and the data you need to visualize. At the same time some mapping can be "purely" aesthetic. De scribe this plot: what is visualized? what is not strictly necessary?
alt.Chart(iris).mark_boxplot().encode(
x='species',
y='petalWidth',
color='species'
)
Altair allows you to use transformers to perform operations directly during the visualization:
alt.Chart(iris).mark_bar().encode(
x='species',
y='count()'
)
alt.Chart(iris).mark_bar().encode(
y='species',
x='mean(sepalLength)'
)
You can read more about the many transformer that are available here: https://altair-viz.github.io/user_guide/transform/index.html
By default Altair understand the data type from Pandas. Nevertheless, there are cases when you need or want to specify the data type. You can do that by adding :LETTER after the data. Q = quantitative data N = nominal data O = ordinal data T = temporal data G = geographic sphere
You can read more about this: https://altair-viz.github.io/user_guide/encodings/index.html#encoding-data-types
alt.Chart(iris).mark_bar().encode(
alt.X("sepalLength:Q", bin=True), #pay attention here. here we are passing the x as alt.X because we need to set some additional parameters.
y='count()'
)
df = pd.read_csv("data_cleaned.csv")
df.head()
| Unnamed: 0.1 | Unnamed: 0 | App | Category | Rating | Reviews | Size | Installs | Type | Price($) | Content Rating | Genres | Last Updated | Current Ver | Android Ver | Price_rank | Success | Review | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 8884 | "i DT" Fútbol. Todos Somos Técnicos. | SPORTS | NaN | 27 | 3.6 | 500+ | Free | 0.0 | Everyone | Sports | 2017-10-07 | 0.22 | 4.1 and up | 8906.0 | False | too soon to call |
| 1 | 1 | 8532 | +Download 4 Instagram Twitter | SOCIAL | 4.5 | 40467 | 22.0 | 1,000,000+ | Free | 0.0 | Everyone | Social | 2018-08-02 | 5.03 | 4.1 and up | 8906.0 | True | good |
| 2 | 2 | 324 | - Free Comics - Comic Apps | COMICS | 3.5 | 115 | 9.1 | 10,000+ | Free | 0.0 | Mature 17+ | Comics | 2018-07-13 | 5.0.12 | 5.0 and up | 8906.0 | False | too soon to call |
| 3 | 3 | 4541 | .R | TOOLS | 4.5 | 259 | 203.0 | 10,000+ | Free | 0.0 | Everyone | Tools | 2014-09-16 | 1.1.06 | 1.5 and up | 8906.0 | False | too soon to call |
| 4 | 4 | 4636 | /u/app | COMMUNICATION | 4.7 | 573 | 53.0 | 10,000+ | Free | 0.0 | Mature 17+ | Communication | 2018-07-03 | 4.2.4 | 4.1 and up | 8906.0 | False | too soon to call |
This is a large dataset. By default Altair prevents from visualizing more the 5000 rows. We need to disable this :
alt.data_transformers.disable_max_rows()
DataTransformerRegistry.enable('default')
alt.Chart(df).mark_bar().encode(
alt.X('Size:Q',bin=alt.Bin(maxbins=50)),
y='count()',
)
alt.Chart(df).mark_boxplot().encode(
x='Success',
y='Rating',
color='Success',
column='Type'
)
alt.Chart(df).mark_line().encode(
x='Last Updated',
y='count()',
color='Content Rating'
)
df.head()
| Unnamed: 0.1 | Unnamed: 0 | App | Category | Rating | Reviews | Size | Installs | Type | Price($) | Content Rating | Genres | Last Updated | Current Ver | Android Ver | Price_rank | Success | Review | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 8884 | "i DT" Fútbol. Todos Somos Técnicos. | SPORTS | NaN | 27 | 3.6 | 500+ | Free | 0.0 | Everyone | Sports | 2017-10-07 | 0.22 | 4.1 and up | 8906.0 | False | too soon to call |
| 1 | 1 | 8532 | +Download 4 Instagram Twitter | SOCIAL | 4.5 | 40467 | 22.0 | 1,000,000+ | Free | 0.0 | Everyone | Social | 2018-08-02 | 5.03 | 4.1 and up | 8906.0 | True | good |
| 2 | 2 | 324 | - Free Comics - Comic Apps | COMICS | 3.5 | 115 | 9.1 | 10,000+ | Free | 0.0 | Mature 17+ | Comics | 2018-07-13 | 5.0.12 | 5.0 and up | 8906.0 | False | too soon to call |
| 3 | 3 | 4541 | .R | TOOLS | 4.5 | 259 | 203.0 | 10,000+ | Free | 0.0 | Everyone | Tools | 2014-09-16 | 1.1.06 | 1.5 and up | 8906.0 | False | too soon to call |
| 4 | 4 | 4636 | /u/app | COMMUNICATION | 4.7 | 573 | 53.0 | 10,000+ | Free | 0.0 | Mature 17+ | Communication | 2018-07-03 | 4.2.4 | 4.1 and up | 8906.0 | False | too soon to call |
df['Month']=pd.to_datetime(df['Last Updated']) + pd.tseries.offsets.MonthEnd(1)
df.head()
| Unnamed: 0.1 | Unnamed: 0 | App | Category | Rating | Reviews | Size | Installs | Type | Price($) | Content Rating | Genres | Last Updated | Current Ver | Android Ver | Price_rank | Success | Review | Month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 8884 | "i DT" Fútbol. Todos Somos Técnicos. | SPORTS | NaN | 27 | 3.6 | 500+ | Free | 0.0 | Everyone | Sports | 2017-10-07 | 0.22 | 4.1 and up | 8906.0 | False | too soon to call | 2017-10-31 |
| 1 | 1 | 8532 | +Download 4 Instagram Twitter | SOCIAL | 4.5 | 40467 | 22.0 | 1,000,000+ | Free | 0.0 | Everyone | Social | 2018-08-02 | 5.03 | 4.1 and up | 8906.0 | True | good | 2018-08-31 |
| 2 | 2 | 324 | - Free Comics - Comic Apps | COMICS | 3.5 | 115 | 9.1 | 10,000+ | Free | 0.0 | Mature 17+ | Comics | 2018-07-13 | 5.0.12 | 5.0 and up | 8906.0 | False | too soon to call | 2018-07-31 |
| 3 | 3 | 4541 | .R | TOOLS | 4.5 | 259 | 203.0 | 10,000+ | Free | 0.0 | Everyone | Tools | 2014-09-16 | 1.1.06 | 1.5 and up | 8906.0 | False | too soon to call | 2014-09-30 |
| 4 | 4 | 4636 | /u/app | COMMUNICATION | 4.7 | 573 | 53.0 | 10,000+ | Free | 0.0 | Mature 17+ | Communication | 2018-07-03 | 4.2.4 | 4.1 and up | 8906.0 | False | too soon to call | 2018-07-31 |
alt.Chart(df).mark_line().encode(
x='Month',
y='count()',
color='Content Rating'
)
heatmap=alt.Chart(df,
title="Types & Success | categorical heatmap").mark_bar().encode(
x='Success',
y='Type',
color='count()'
).properties(
width=500,
height=500
)
heatmap
stacked=alt.Chart(df,
title="Success | Stacked data").mark_bar().encode(
x='Success',
y='count()',
color='Review'
).properties(
width=500,
height=500
)
stacked
stacked|heatmap
Altair has powerful interactive capabilities. You can read more on the specific resources but here we can still introduce few basic elements and show how to add basic interactive functionalities to your visualizations.
There are two concepts that are core to the basic of interactive visualization in altair:
Parameters are the basic building blocks in the grammar of interaction. They can either be simple variables or more complex selections that map user input (e.g., mouse clicks and drags) to data queries.
Conditions and filters can respond to changes in parameter values and update chart elements based on that input.
Let's see an example by adding a parameter to select the data.
brush = alt.selection_interval()
Once we have done that we can set a specific element of the visualization to be conditional to the parameter:
alt.Chart(df).mark_point().encode(
x='Reviews:Q',
y='Size:Q',
color=alt.condition(brush,'Success:N',alt.value('lightgray'))
).add_params(brush)
There are multiple ways of selecting:
pointer = alt.selection_point()
alt.Chart(df).mark_bar().encode(
x='Success:N',
y='mean(Size):Q',
color=alt.condition(pointer,'Success:N',alt.value('lightgray'))
).add_params(pointer)
and this can be combined in multiple effective solutions.
interval = alt.selection_interval(encodings=['x'])
lines=alt.Chart(df).mark_line().encode(
x='Month',
y='count()',
color='Content Rating'
).add_params(interval)
bars=alt.Chart(df).mark_bar().encode(
x='Success',
y='count()',
color='Content Rating'
).transform_filter(interval)
lines.encode()|bars
from importlib.metadata import version
version('altair')
'5.0.1'